Mixture modeling with component-dependent partitions

ABSTRACT

Described are variational Expectation Maximization (EM) embodiments for learning a mixture model using component-dependent data partitions, where the E-step is sub-linear in sample size while the algorithm still maintains provable convergence guarantees. Component-dependent data partitions into blocks of data items are constructed according to a hierarchical data structure comprised of nodes, where each node corresponds to one of the blocks and stores statistics computed from the data items in the corresponding block. A modified variational EM algorithm computes the mixture model from initial component-dependent data partitions and a variational R-step updates the partitions. This process is repeated until convergence. Component membership probabilities computed in the E-step are constrained such that all data items belonging to a particular block in a particular component-dependent partition behave in the same way. The E-step can therefore consider the blocks or chunks of data items via their representative statistics, rather than considering individual data items.

BACKGROUND

Probabilistic mixture modeling is an important machine learningtechnique that has been extensively used for the tasks of densitymodeling and clustering. For clustering, individual mixture componentsrepresent the clusters. Mixture modeling is generally used forclustering data, such as media data, documents, signal data, scientificobservations or measurements, etc. The Expectation-Maximization (EM)algorithm is among the most popular methods that are used for this task.The EM algorithm iteratively updates a model estimate until convergence.In practice, an iteration of the EM algorithm for mixture modelclustering includes an E-step which, given a current model estimate,calculates cluster-membership probabilities for each data item in orderto construct sufficient statistics, followed by an M-step whichgenerates a new model estimate from those statistics. Each E-step has acomputational complexity of O(N*C), where N is the number of data cases(samples) and C is the number of mixture components (or clusters) in themodel. For very large N and C, for example, Internet-scale data, thecomputational complexity of the EM algorithm can be prohibitive. Putanother way, the EM algorithm does not scale well in N and does notscale well in C.

Techniques related to efficient variational EM mixture modeling arediscussed below.

SUMMARY

The following summary is included only to introduce some conceptsdiscussed in the Detailed Description below. This summary is notcomprehensive and is not intended to delineate the scope of the claimedsubject matter, which is set forth by the claims presented at the end.

Described are embodiments based on a variational ExpectationMaximization (EM) framework, which alters the E-step by usingcomponent-dependent data partitions instead of individual data items.Consequently, the E-step is sub-linear in sample size while theseembodiments still maintain provable convergence guarantees.

The data items are organized into a hierarchical data structurecomprised of nodes, where each node corresponds to a block of the dataand stores statistics that represent the data items in the correspondingblock. A coarse partition of the data items according to some level ofgranularity in the hierarchical data structure is assigned to eachcomponent (cluster) in the mixture model. These component-dependentpartitions may be different for different components, and within aparticular partition the block sizes may even be different. A version ofthe variational EM is then performed that constrains the clustermembership probabilities that are computed in the E-step such that alldata items belonging to a particular block in a particularcomponent-dependent partition behave in the same way. The E-step cantherefore be performed by considering the blocks or chunks of data itemsvia their representative statistics, rather than by consideringindividual data items. Following, the component-dependent partitions areselectively refined and the variational EM is again applied. Thisrefinement process may continue until further refinements do notsignificantly change the clustering result.

Many of the attendant features will be explained below with reference tothe following detailed description considered in connection with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present description will be better understood from the followingdetailed description read in light of the accompanying drawings, whereinlike reference numerals are used to designate like parts in theaccompanying description.

FIG. 1 shows a computing device running a program configured to performa component-dependent EM (CD-EM) algorithm.

FIG. 2 shows a CD-EM algorithm embodiment.

FIG. 3 shows a simple example of an MPT.

FIG. 4 shows an algorithm for performing a CD-EM E-step.

FIG. 5 shows a refining step (R-step) algorithm.

FIG. 6 shows a detailed CD-EM algorithm incorporating an E-stepalgorithm, an M-step, and an R-step algorithm.

FIG. 7 shows a computer.

DETAILED DESCRIPTION

Described herein are variants of the EM algorithm that scale well in Nand C, have provable convergence guarantees, and have no additionalparameters to set or tune beyond the limit of convergence. The EMvariant approach described herein involves gradually refiningcomponent-dependent partitions while running the algorithm. Morespecifically, for a variational EM framework or basis, the variationaldistribution is constrained in a way that depends on the graduallyrefined data partition(s) into chunks (note that the variationaldistribution equals an approximation to the true optimal posteriorcluster-membership probabilities for the data items): Chunks may also bereferred to as blocks.

Prior approaches have described a variational EM algorithm forlarge-scale mixture modeling (see Verbeek et al., “A variational EMalgorithm for large-scale mixture modeling”, ASCI, 2003; and Verbeek etal., “Accelerated EM-based clustering of large data sets”, Data Miningand Knowledge Discovery, 13(3):291-307, 2006). These approaches are alsobased on an application of the variational EM framework by Neal et al.(“A view of the EM Algorithm that justifies incremental, sparse, andother variants”). Within this framework, the variational distribution(approximation to true optimal solution) is constrained in a way thatdepends on the gradually refined data partition(s) into chunks. However,Verbeek's approach has all mixture components share the same datapartition, which increases the computational load. Embodiments describedherein extend this approach in a way that considers different datapartitions for different mixture components in the model, yet within thevariational EM framework. The approach improves how performance scaleswith respect to both N and C.

FIG. 1 shows a computing device 100 running a program 102 configured toperform a component-dependent EM (CD-EM) algorithm. The computing device100 may in practice be a computer running the program 102 or a group ofcomputers communicating by a network, backplane, bus, etc. tocooperatively execute the program 102 as a distributed application. Anexample computer is described with reference to FIG. 7. The program 102receives or accesses a dataset 104 that includes a large number of dataitems 106. A data item 106 may be considered to be any piece ofinformation stored in digital form on a computer.

Regarding the program 102 and the executing of machine or CPUinstructions in accordance with the CD-EM algorithm, the original EMalgorithm includes an E-step and an M-step (not to be equated with theE-step and M-step described here, which will sometimes be referred to asthe CD E-step and the CD M-step). The CD-EM algorithm also has an E-step108 and M-step 110, which will be described in detail below. The E-step108 and the M-step 110 iterate as needed and then an R-step 112 isperformed. The E-step 108 and M-step 110 loop with the R-step 112 untila stopping point is reached. The iterations continuously refine amixture model 114, which can be used to cluster future data (as well asthe data in 104 used to construct the model). For example, someungrouped data 115 is processed by the mixture model 114 which formsclusters 116 of the data. It is to be understood that this clustering isprobabilistic (or soft) in the sense that a data item is assigned with afraction to each cluster rather than a hard assignment to just onecluster.

FIG. 2 shows a CD-EM algorithm embodiment. The CD-EM algorithm performsthe variational EM algorithm in a way that considers differentpartitions for different mixture components. First, the data items 106in dataset 104 are grouped 120 into a hierarchical data structure. Thesegroups will also be referred to as chunks. For example, the data items106 may be divided into groups that are placed into the nodes of aKD-tree. Note that other types of hierarchies can be used, such as thedata structure behind the spatial index in an SQL execution engine(e.g., an n-ary tree). Sufficient statistics are then cached 121 foreach chunk. This will be described in detail below.

A coarse partition of the data items, according to some level ofgranularity in the hierarchical data structure, is assigned 122 to eachcomponent (cluster) in the mixture model, where thesecomponent-dependent partitions may be different for differentcomponents. Within a particular partition the block sizes may also bedifferent. While any initial component-dependent partitions can be used,in one embodiment, initial component-dependent partitions can beselected to start with all component-dependent partitions similar to thecoarsest level in the hierarchical data structure, which has at least asmany nodes (or data blocks) as there are components in the mixturemodel.

Given the initial hierarchy with cached statistics, the variational EMframework (described below) is used to learn 123 a variationaldistribution of the data items 106. All data items 106 in a given chunkare treated as similar to each other when considering the associatedmixture component. The granularity of the component-dependent partitionsis selectively expanded 124 as optimal for individual mixturecomponents.

If the granularity is sufficient 126 for the variational distribution tobe a good approximation for the solution (that is, there is someconvergence), then the process finishes 128 and the final mixture modeland in some cases the final clustering for the data items used to trainthe model (the final results 114) are saved, stored to disk, transmittedby network, etc., for use in solving any real world problem representedby such data items. Otherwise, the learning 123 and expanding 124 arerepeated as necessary. Given the stored or transmitted final results 114(a mixture model trained according to the data items), any real worldentities, scientific observations, database records, etc. similar to thedata items may now be clustered and subject to analysis, decision makingetc. using the saved final mixture model (final results 114). Media datarepresented by the data items may be indexed in a search engine based inpart on the clustering information, and so on.

Tree-Consistent Partition

The number of possible partitions of a dataset of data items issuper-exponentially large if no constraints are imposed. The partitionsare therefore restricted by organizing the data items (to be referred toas “data” and “data points”) using a tree structure, where a node in thetree represents the union of the data from its children. Individual datapoints are not actually stored in each node, but rather, sufficientstatistics for estimation operations are pre-computed and stored in thenodes. A tree-consistent partition is any partition where each block ofdata corresponds to one node for a cut (possibly across differentlevels) in the tree (discussed below with reference to FIG. 3). Althoughthe number of tree-consistent partitions is still exponentially large,the tree structure makes efficient search from coarse to fine partitionspossible. Any hierarchical decomposition of data is suitable forconstructing tree-consistent partitions. In one embodiment a KD-tree isused. A KD-tree is created in O(N log N) time, where N is the number ofdata points.

Variational EM

Given a set of independently and identically distributed data x

{

₁, . . . ,

_(N)} (data items), a goal is to estimate the parameters θ={_(1:K),π_(1:K)} in the K-component mixture model with log-likelihood function

(θ)=Σ_(n) log Σ_(k) p(

_(n)|η_(k))π_(k).  (1)

For this task, a variational generalization of standard EM may be used,which maximizes a lower bound of

(θ) through the introduction of a variational distribution q. It isassumed that that the variational distribution factorizes in accordancewith data points, i.e, q=Π_(n) q_(n), where each q_(n) is an arbitrarydiscrete distribution over mixture components k=1, K. Note

(θ) can be lower bounded by multiplying each p(

_(n)|η_(k))π_(k) with

$\frac{q_{n}(k)}{q_{n}(k)}$

and by applying Jensen's inequality to get

$\begin{matrix}{{\mathcal{L}(\theta)} \geq {\sum\limits_{n}{\sum\limits_{k}{{q_{n}(k)}\left\lbrack {{\log \; {p\left( {x_{n}\eta_{k}} \right)}\pi_{k}} - {\log \; {q_{n}(k)}}} \right\rbrack}}}} & (2) \\{{= {{{\mathcal{L}(\theta)} - {\sum\limits_{n}{{KL}\left( {q_{n}\left. {{{p\left( \cdot  \right.}x_{n}},\theta} \right)} \right)}}}\overset{\Delta}{=}{\mathcal{F}\left( {\theta,q} \right)}}},} & (3)\end{matrix}$

where p(•|

_(n), θ) defines the posterior distribution of cluster-membershipprobabilities and KL(q∥p) is the Kullback-Leibler (KL) divergencebetween q and p. The variational EM algorithm alternates the followingtwo steps, i.e. coordinate ascent on

(θ, q), until convergence.

E-step: q ^(t+1)=arg max_(q)

(θ^(t) ,q),

M-step: θ^(t+1)=arg max_(θ)

(θ,q ^(t+1))

If q is not restricted in any form, the E-step produces q^(t+1)=Π_(n)p(•|

_(n), θ^(t)) because the KL-divergence is the only term in equation (3)depending on q. As pointed out elsewhere, the variational EM is in thiscase equivalent to the standard EM, and hence produces the maximumlikelihood (ML) estimate. In the following, certain ways of restrictingq to attain speedup over standard EM will be considered, implying thatthe minimum KL-divergence between q_(n) and p(•|

_(n), θ) is not necessarily zero. Still, the variational EM defines aconvergent algorithm, which instead approximates the ML estimate byoptimizing a lower bound on the log-likelihood.

Chunky EM

The chunky EM algorithm is a type of variational EM algorithm. Thechunky EM algorithm is described by Verbeek et al. in “A variational EMalgorithm for large-scale mixture modeling” (ASCI, 2003), and in“Accelerated EM-based clustering of large data sets” (Data Mining andKnowledge Discovery, (13(3):291-307, 2006). With the so-called chunkyEM, the variational distribution q=Π_(n) q_(n) is restricted accordingto tree-consistent partitions of the data. For a given partition, ifdata points x_(i) and x_(j) are in the same block, then q_(i)=q_(j). Thebasis is that data points in the same block are somewhat similar and canbe treated in the same way, which leads to computational savings in thecomputationally demanding E-step (further details are provided below inthe section titled CD-EM: EFFICIENT VARIATIONAL E-STEP). In fact, if Mis the number of blocks in a data partition, the E-step for chunky EMhas cost O(KM), whereas in standard EM the cost is O(KN). Thecomputational speedup can therefore be significant for M<<N.

The aforementioned speedup is traded off with tightness in the lowerbound for the log-likelihood, according to restrictiveness ofconstraints on the variational distribution. To address this tradeoff,chunky EM starts from a coarse data partition and runs untilconvergence. The partition is then refined and the algorithm runs untilconvergence at a tighter lower bound. The hierarchical decomposition ofpartitions into a tree structure ensures that refinements will alwaysproduce a tighter bound, because restrictions on the variationaldistribution are gradually relaxed in this way. The refinement process,and hence chunky EM, stops when the refining of any partition will notsignificantly increase the lower bound.

Component-Dependent EM (CD-EM)

The remainder of this Detailed Description will describe thecomponent-dependent EM approach. In chunky EM, all mixture componentsshare the same data partition, regardless of whether they are far apart.However, the variation in membership probabilities for a component isgreatest for data close to its center and insignificant far away. Acomputational speedup can be gained if component-dependent datapartitions are created. Let M_(k) be the number of data blocks in thepartition for component k. The complexity for the E-step is then O(Σ_(k)M_(k)), compared to O(KM) in chunky EM. The log-likelihood can belower-bounded equally well with Σ_(k) M_(k) significantly smaller thanKM, resulting in a much faster E-step. Because the model maintainsdifferent partitions for different mixture components, it is referred toas the component-dependent EM algorithm (CD-EM).

CD-EM: Marked Partitioning Trees

For context and understanding, some properties of tree-consistentpartitions will now be described. First, a marked partition tree (MPT)is defined as a simple encoding of all component-dependent partitions.Let

_(k) be the set of data blocks in the tree-consistent partition formixture component k. In the shared data partition tree used to generatethe component-dependent partitions, mark the corresponding nodes for thedata blocks in each

_(k), by the component identifier k. The MPT is now obtained by pruningfrom the shared data partition tree all unmarked nodes without markeddescendants.

FIG. 3 shows a simple example of an MPT 140. Trees 1-5 represent 5mixture components with individual partitions indicated by the blacknodes. The bottom-right figure is the corresponding MPT 140, where {•}indicates the component marks and a, b, c, d enumerate the leaf nodes.

The example in FIG. 3 is of interest because all nodes in the tree aremarked. In general, an MPT may have unmarked nodes at any location abovethe leaves. For example, in chunky EM, the component-dependentpartitions are the same for each mixture component. In that case, onlythe leaves in the MPT are marked, with each leaf marked by all mixturecomponents. The following property for an MPT holds true, due to thefact that all component-dependent partitions are constructed withrespect to the same data partition tree:

Corollary 1. Let

denote a MPT. The marked nodes on a path from leaf to root in

mark exactly one data block from each of the K component-dependent datapartitions.

CD-EM: Variational Distribution

Variational distribution q assigns the same variational membershipprobability to mixture component k for all data points in acomponent-dependent block B_(k)ε

_(k). That is,

q _(n)(k)=q_(B) _(k) for all

_(n) εB _(k),  (4)

which is denoted as the component-dependent block constraint. As opposedto chunky EM, it is not assumed that the data partition

_(k) is the same across different mixture components. This extraflexibility complicates the estimation of q in the E-step. Nevertheless,it will be seen that the estimation can still be performed in anefficient way.

Note that it should be ensured that for each data point, the optimizedq_(n)(•) is a probability distribution. That is, Σ_(k) q_(n)(k)=1 andq_(n)(k)≧0. For now, discussion will focus on the sum-to-one constraints(in the CD-DM: EFFICIENT VARIATIONAL E-STEP section below, it will beseen that the positivity constraints are automatically enforced). Let

denote the leaves in

, let B_(l) denote the block of data associated with a leaf lε

, and let B_(1(l)), . . . , B_(K(l)) denote the data blocks encounteredon a specific path from a leaf lε

to the root of

. Hence, B_(k(l)) is the data block B_(k)ε

_(k) for which B_(l) ⊂B_(k). With the component-dependent blockconstraints in equation (4) and using Corollary 1, the sum-to-oneconstraints can now instead be written as |

| constraints

Σ_(k) q _(B) _(k(l)) =1 for all lε

.  (5)

Note that multiple q_(B) _(k(l)) may represent the same q_(B) _(k)across different constraints in equation (5). In fact,

q _(B) _(k) =q _(B) _(k) for lε{lε

|B _(l) ⊂B_(k)}  (6)

implying that the constraints in equation (5) are intertwined accordingto a nested structure given by

. The closer a data block B_(k) is to the root of

, the more constraints simultaneously involve q_(B) _(k) , which is whatcauses the additional complications in the E-step compared to chunky EM.

Referring again to FIG. 3 and the MPT 140 therein

q _(B) _(1(a)) =q _(B) _(1(b)) , q _(B) _(2(a)) =q _(B) _(2(b)) , q _(B)_(3(c)) =q _(B) _(3(d)) , q _(B) _(4(c)) =q _(B) _(4(d)) and q _(B)_(5(a)) =q _(B) _(5(b)) =q _(B) _(5(c)) =q _(B) _(5(d))

CD-DM: Efficient Variational E-Step

With the component-dependent block constraint on the variationaldistribution in equation (4), the lower bound

(θ, q) can be reformulated as a sum of individual functions of q_(B)_(k) ,

(θ, q_(B) _(k) ), as follows. Rearranging equation (2) using equation(4):

$\begin{matrix}\begin{matrix}{{\mathcal{F}\left( {\theta,q} \right)} = {\sum\limits_{k}{\sum\limits_{B_{k} \in B_{k}}{{B_{k}}{q_{B_{k}}\left\lbrack {g_{B_{k}} + {\log \; \pi_{k}} - {\log \; q_{B_{k}}}} \right\rbrack}}}}} \\{{= {\sum\limits_{k}{\sum\limits_{B_{k} \in B_{k}}{\mathcal{F}\left( {\theta,q_{B_{k}}} \right)}}}},}\end{matrix} & (7)\end{matrix}$

having defined the block-specific geometric mean

g _(B) _(k) =

log p(

|η_(k))

_(B) _(k) =Σ

_(εB) _(k) log p(

|η_(k))/|_(B) _(k) |  (8).

It is helpful to the computational savings that g_(B) _(k) can becalculated from pre-computed statistics, which is in fact the case forthe large class of exponential family distributions.

Example: Let p(

|η_(k)) be an exponential family distribution

p(

|η_(k))=h(

)exp(η_(k) ^(T) T(

)−A(η_(k)))  (9)

where η_(k) is the natural parameter, h(x) is the reference function,T(x) is the sufficient statistic, and A(η_(k)) is the normalizingconstant. Then g_(B) _(k) =

log h(

)

_(B) _(k) +η_(k) ^(T)

T(

)

_(B) _(k) −A(η_(k)) where

log h(

)

_(B) _(k) and

T(

)

_(B) _(k) are the statistics that are pre-computed for equation (8). Inparticular, if p(

|η_(k))=

_(d)(μ_(k), Σ_(k)), a Gaussian distribution, then

h(

)=T, T(

)=(

,

^(T)), η_(k)=(μ_(k)Σ_(k) ⁻¹,−Σ_(k) ⁻¹/2),

A(η_(k))=(−1/2)(d log(2π)+log|Σ_(k)|+μ_(k) ^(T)Σ⁻¹μ_(k)),

and the statistics

log h(

)

_(B) _(k) =0 and

T(

)

_(B) _(k) =(

_(B) _(k) ,

^(T)

_(B) _(k) ) can be pre-computed.

The sum-to-one constraints in equation (5) are integrated into the lowerbound in equation (7) by using the standard principle of Lagrangeduality (see, e.g., Boyd and Vandenberghe, “L. Convex Optimization”,Cambridge University Press, 2004.),

(θ, q, λ)=Σ_(k) Σ_(B) _(k)

(θ, q_(B) _(k) )+Σ_(l) λ_(l)(Σ_(k) q_(B) _((l)) −1), where λ

{λ₁, . . . , λ_(L)} are the Lagrange multipliers for the constraints inequation (5). Recall the relationship between q_(B) _(k) and q_(B)_(k(l)) in equation (6). By taking the derivative of

(θ, q, λ) with respect to q_(B) _(k) and setting it to zero, obtain

$\begin{matrix}{{q_{B_{k}}(\lambda)} = {{\exp\left( {\frac{\sum\limits_{l:{B_{l} \subseteq B_{k}}}\lambda_{l}}{B_{k}} - 1} \right)}\pi_{k}{{\exp \left( g_{B_{k}} \right)}.}}} & (10)\end{matrix}$

The optimal value q_(B) _(k) *=q_(B) _(k) (λ*) is now determined bysolving the dual problem λ*=arg min_(λ)

(θ, q(λ), λ). Note that the expression in equation (10) ensures thatq_(B) _(k) (λ*)≧0, and therefore automatically satisfies the positivityconstraint.

A particular simple situation appears in the setting of chunky EM, whereB_(k)=B_(l) for all k=1, . . . , K. In this case, Σ_(l:B) _(l) _(⊂B)_(k) λ_(l)=λ_(l), and substituting equation (10) into the Lagrange dualfunction

(θ, q(λ), λ) reveals that the optimization of λ can be performedindependently for each λ_(l) as λ_(l)*=|B_(l)|(1−log Σ_(k) π_(k)exp(g_(B) _(k) )) Inserting this value back into equation (10) nowpresents a simple closed-form solution of the optimal variationaldistribution, q_(B) _(k) *=π_(k) exp(g_(B) _(k) )/Z where Z=Σ_(k) π_(k)exp(g_(B) _(k) ) is a normalizing constant.

This simple form of the optimization may is not possible when thepartitions are allowed to be different across mixture components,because this setup creates intertwined constraints according to a nestedstructure given by

, as shown above. However, a closed-form solution can still be obtainedby carefully exploiting the nesting structure of the constraints.

FIG. 4 shows an algorithm 160 for performing a CD-EM E-step. Algorithm160 describes a general solution to the CD-EM E-step (i.e., how the qfunctions are optimized), which first traverses

bottom up, level by level and gradually reduces the nested constraintsto a single equation involving only one λ_(l). After solving thisequation the algorithm now traverses

top down, level by level and gradually resolves the remaining Lagrangemultipliers λ_(l), lε

by back-substitution of previously resolved λ_(l). The resolved Lagrangemultipliers can now be used to find the desired variational distributionq. The next corollary is useful for the detailed description of theindividual steps in algorithm 160 that follows below.

Corollary  2.  Let  i, i^(*) ∈ {1, …  , I}.If${{\exp \left( {\frac{\lambda_{i}}{B_{i}} - 1} \right)}D_{i}} = {{\exp \left( {\frac{\lambda_{i^{*}}}{B_{i^{*}}} - 1} \right)}D_{i^{*}}}$${then},{\lambda_{i} = {{B_{i}}{\left( {\frac{\lambda_{i^{*}}}{B_{i^{*}}} + {\log \; D_{i^{*}}} - {\log \; D_{i}}} \right).}}}$

Algorithm 160 has an initialization phase 162. The initialization phase162 computes some initial statistics from the pre-computed sufficientstatistics associated with the hierarchical data structure (e.g., in thenodes of the KD tree). For the initialization phase 160, let

denote the set of nodes in the MPT

. When computing the solution for the variational distribution it isconvenient to define three scalar values K_(v), C_(v) and D_(v)(sufficient statistics) for each vε

K_(v). Initially,

K _(v)=0, C _(v)=

π_(k)exp(g _(B) _(k) ), D _(v) =C _(v).  (12)

where

_(v) denotes the set of mixture components for which the set's datapartition has a data block associated with node

. For each vΣV, define the node distribution

$\begin{matrix}{q_{v} = {{\sum\limits_{k \in K_{v}}q_{B_{k}}} = {{\exp\left( {\frac{\sum\limits_{l:{B_{l} \subseteq B_{v}}}\lambda_{l}}{B_{v}} - 1} \right)}C_{v}}}} & (13)\end{matrix}$

where the second equality follows from equations (10) and (12).

Algorithm 160 next performs a collect-up phase 164. The collect-up phase164 traverses the tree

bottom up, level by level, where each step in this traversal considers anode v and its children uεch(v). The crux of the collect-up phase 164 isa manipulation of the parent-node distribution q_(v) in such a way thatthe Σ_(l:B) _(l) _(⊂B) _(v) λ_(l) is replaced by a single λ_(l)=λ_(l*)_(v) of reference, where the index l_(v*) emphasizes that it is areference-λ that has been selected at node

. In particular, as described below, the collect-up phase will transformthe representation in (13) into a representation

$\begin{matrix}{q_{v} = {{\exp \left( {\frac{\lambda_{l_{v}^{*}}}{B_{l_{v}^{*}}} - 1} \right)}C_{v}{\exp \left( \frac{K_{v}}{B_{v}} \right)}}} & (14)\end{matrix}$

by deriving the equality

$\begin{matrix}{{\sum\limits_{l:{B_{l} \subseteq B_{v}}}\lambda_{l}} = {{{B_{v}}\frac{\lambda_{l_{v}^{*}}}{B_{l_{v}^{*}}}} + K_{v}}} & (15)\end{matrix}$

assuming a similar representation for each child-node distributionq_(u), uεch(v). Note that node distributions at leaf nodes only involveone λ and therefore trivially obey this condition, so traversal startsone level above. The transformation starts by the equality (recall thatuεch(v),

$\begin{matrix}{{\sum\limits_{l:{B_{l} \subseteq B_{v}}}\lambda_{l}} = {{\sum\limits_{u}{\sum\limits_{l:{B_{l} \subseteq B_{u}}}\lambda_{l}}} = {{\sum\limits_{u}{{B_{u}}\frac{\lambda_{l_{u}^{*}}}{B_{l_{u}^{*}}}}} + K_{u}}}} & (16)\end{matrix}$

which implies that the Σ_(l:B) _(l) _(⊂B) _(v) λ_(l) involved in theparent-node distribution q_(v) can be reduced to an expression involvingjust the λ_(l*) _(u) of reference for each of the |ch(v)| children. Inorder to transform equation (16) into one that involves only the singleλ_(l*) _(v) of reference, the following procedure is applied.

Let l

r denote the path of nodes (l=w₁, w₂, . . . , w_(n)=r) from leaf l toroot r in

, and let w_(i)

w_(j), 1≦i,j≦n denote a subpath of l

r. With the notation q_(w) _(i)

_(w) _(j) =

q_(w), the constraints in equation (5) can be written as

=

+

=1 for all lε

  (17)

where uεch(v) and vε

are both on the path l

r. Since all children share the same path v

r from their parent to the root, the constraints in (17) imply equalityof all

, uεch(v). In particular, we can ensure that for each uεch(v) thereexists a path, where the reference-λ_(s) for all nodes on the path arethe same. (E.g., we can always choose the reference-λ associated withthe left-Most child at each step in the traversal.) We can thereforeconstruct

$\begin{matrix}{\left. {ql}_{u}^{*}\mapsto u \right. = {{\exp \left( {\frac{\lambda_{l_{u}^{*}}}{B_{l_{u}^{*}}} - 1} \right)}D_{u}}} & (18)\end{matrix}$

where

$\begin{matrix}{D_{u} = {\sum\limits_{w \in {l_{u}^{*}\mapsto u}}{C_{w}{{\exp \left( \frac{K_{w}}{B_{w}} \right)}.}}}} & (19)\end{matrix}$

Thus, the condition for Corollary 2 is satisfied for all

, uεch(v), which allows selection one of the λ_(l*) _(u) as the λ_(l*)_(v) and representation of each λ_(l*) _(u) as

$\begin{matrix}{\lambda_{l_{u}^{*}} = {{B_{l_{u}^{*}}}{\left( {\frac{\lambda_{l_{\upsilon}^{*}}}{B_{l_{\upsilon}^{*}}} + {\log \; D_{u^{*}}} - {\log \; D_{u}}} \right).}}} & (20)\end{matrix}$

where u* denotes the child with the chosen λ_(l*) ^(u)=λ_(l*) ^(v) ofreference. Substituting (20) into (16), gives

$\begin{matrix}\begin{matrix}{{\sum\limits_{l:{B_{l} \subseteq B_{v}}}\lambda_{l}} = {\sum\limits_{u}\left( {{{B_{u}}\left( {\frac{\lambda_{l_{v}^{*}}}{B_{l_{v}^{*}}} + {\log \; D_{u^{*}}} - {\log \; D_{u}}} \right)} + K_{u}} \right)}} \\{= {{{B_{v}}\frac{\lambda_{l_{v}^{*}}}{B_{l_{v}^{*}}}} + {{B_{v}}\log \; D_{u^{*}}} + {\sum\limits_{u}\left( {{{- {B_{u}}}\log \; D_{u}} + K_{u}} \right)}}} \\{\overset{\Delta}{=}{{{B_{v}}\frac{\lambda_{l_{v}^{*}}}{B_{l_{v}^{*}}}} + K_{v}}}\end{matrix} & (21)\end{matrix}$

which is the desired equality in equation (15) leading to therepresentation of q_(v) as in equation (14) by substituting intoequation (13). The algorithm is now ready for the next collect-up stepin the collect-up phase 164.

A collect-up step 164 is summarized by the following updates

$\begin{matrix}\left. K_{v}\leftarrow{{{B_{v}}\log \; D_{u^{*}}} + {\sum\limits_{u}\left( {{{- {B_{u}}}\log \; D_{u}} + K_{u}} \right)}} \right. & (22) \\\left. D_{v}\leftarrow{{C_{v}{\exp \left( \frac{K_{v}}{B_{v}} \right)}} + D_{u^{*}}} \right. & (23)\end{matrix}$

which are induced from equations (21) and (19).

The algorithm 160 then performs a distribute-down phase 166. After acollect-up traversal, each node in

is associated with a particular reference-λ, and these reference-λs arerelated as in equation (20). The distribute-down phase 166 traverses

top down, where each step uses the reference-λ in the parent node toresolve the reference-λs associated with its child nodes. Notice thatthe update is redundant for the particular child with reference-λ chosenas the reference-λ for the parent in the collect-up traversal, where,λ_(l*) _(u) =λ_(l*) _(v) . The root node starts the recursion and needsspecial treatment. The constraints in (17) imply that

${\left. {pl}_{r}^{*}\mapsto r \right. = {{{\exp \left( {\frac{\lambda_{l_{r}^{*}}}{B_{l_{r}^{*}}} - 1} \right)}D_{r}} = 1}},$

which can be solved to obtain

λ_(l*) _(r) ==|B _(l*) _(r) |(1−log D _(r)).  (24)

To finalize, with λ_(l) available for all lε

, all q_(B) _(k) can be determined by simply inserting the appropriateΣλ_(l) into equation (10). Alternatively, it is more efficient to updateq_(B) _(k) during each step in the distribute-down phase 166, sinceλ_(l*) _(v) is available at this point, and Σλ_(l) therefore can becomputed using equation (15).

Finally, additional efficiency can be gained in the variational CD-EME-step by cutting unmarked nodes (without marked ancestors) from the topof

. We in this way create a forest of MPTs {

₁, . . .

_(k)}. Each of these trees can now be treated independently during theestimation of the variational distribution, by noticing that (17) holdstrue in the root of each new tree, because q_(v)=0 for any unmarked nodev in

.

CD-DM: Efficient Variational M-Step

In the variational CD-EM M-step the model parameters θ={η_(1:K),π_(1:K)} are updated by maximizing equation (7) with respect to θ underthe constraint Σ_(k) π_(k)=1. Hereby, the update is

π_(k)∝

|B _(k) |q _(B) _(k)   (25)

η_(k)=arg max_(η) _(k)

|B _(k)|q_(B) _(k) g _(B) _(k)   (26).

Notably, the model update in the CD-EM M-step can be efficientlycomputed, because the same pre-computed sufficient statistics used forupdating the variational distribution in the CD-EM E-step can be used inthe above update equations.

Example: If p(

|η_(k)) has the exponential family form in equation (9), then the M-stepupdate for η_(k) is obtained by solving

η_(k)=arg max_(η) _(k) {

q _(B) _(k) Σ

εB _(k) T(

)}η_(k) −{

|B _(k) |qB _(k) }A(η_(k)).

In particular, if p(

|η_(k))=

_(d) (μ_(k), Σ_(k)), then

Σ_(k)=(

|B _(k) |q _(B) _(k)

^(T)

_(B) _(k) −μ_(k)μ_(k) ^(T))/(Nπ _(k)).

CD-EM: Variational R-Step

FIG. 5 shows a refining step (R-step) algorithm 190. Algorithm 190 showsa strategy for refining the partitions. Given the currentcomponent-dependent data partitions, represented by

, the R-step selectively refines the partitions for some mixturecomponents. Any R-step enlarges the family of variational distributions,and therefore the optimal lower bound for the log-likelihood obtained atthe refined partitions is at least as tight as for coarser partitions.Hence, running variational EM after any R-step is guaranteed to increasethe bound, as also shown in the CHUNKY EM section above. Regarding useof the priority queue refinement candidates (the queue elements) areinserted into the priority queue with priority according to theirDelta-F score, as described below. At the end of the R-step, the K bestcandidates are chosen by obtaining from the queue the element that hasthe highest priority.

Refining one data block in a partition for one component will beconsidered to be a refinement unit. The overall efficiency of CD-EM isaffected by the number of refinement units performed at each R-step.With too few units, too much time might be spent on refining, and withtoo many units, the choices might be too far from optimal and thereforemight unnecessarily slow down the E-step. Empirical analysis has shownthat K refinement units at each R-step is a balanced choice. The choiceof K refinement units introduces K new free variational parameters,similar to a refinement step in chunky EM. However, in chunky EM therefinement units are evenly distributed across components, which is notthe case with CD-EM.

Ideally, a CD-EM R-step should select the refinements leading to optimalimprovement for

. Candidates can be found by performing a single E-step for eachcandidate and then selecting refinements that most improve

. Each of the Σ_(k) M_(k) current data blocks is to be evaluated, thusthe evaluation of each candidate should be efficient. However, theintertwined nature of the constraints on the variational distributionprevents local E-step computations, since all q_(B) _(k) s are tiedtogether. As an approximation, consider the following local computationscheme based on the idea that refining a data block for a specificcomponent mostly affects the q_(B) _(k) s with similar local blockstructure.

Recall that a data block in a partition is associated with a mark kε

_(v) in a node vε

. Consider a marked node v in

and move all marks in v to its children (each child has a copy). Let

denote the altered MPT and

_(u) denote the set of marks for u, uεch(v). Now, update the variationaldistribution q under the following constraints

q _(B) _(k) +g _(rest)=1 for all uεch(v)  (27)

where q_(rest) represents fixed values obtained for the components notin

_(u) before the refinement. Note that these constraints are notintertwined and the updated q can therefore be computed locally inclosed form—as in equation (11)—with fixed q_(rest). Finally, theimprovement to

(denoted Δ

_(v,k)) that is used to rank the refinement candidates only involves thecomputation of

(θ, q_(B) _(k) ) for the original and altered component-dependent datablocks in a refinement, so they can also be computed efficiently. Goingthrough each marked node in the MPT

and computing candidate refinement improvements in this way will lead toa computational efficient ranking of all the candidate refinement unitsaccording to

.

Overall Algorithm

FIG. 6 shows a detailed CD-EM algorithm 210 incorporating the E-stepalgorithm 160, the M-step 212, and the R-step algorithm 190. Startinginitialization 214 from a coarse partition for each component, the CD-EMalgorithm 190 runs variational EM to convergence. Thecomponent-dependent partitions are then selectively refined by movingthe marks for some components downward in

, and this process continues until the guaranteed convergence isachieved, and further refinements will not significantly improve thelower bound.

More specifically, from beginning to end, with algorithm 210, after thehierarchical data structure is initialized, component-dependent datapartitions are formed. These component-dependent data partitions may bedifferent for different mixture components. Then, given thecomponent-dependent data partitions, a variational EM step is run (fromthe current parameter estimate) until convergence. Each E-step-M-stepiteration includes the variational CD E-step, with computation of theapproximating variational distribution that treats all data in acomponent-dependent data-chunk as similar for that mixture component.The CD E-step algorithm 160 provides scalability in both N and C, whereC is the number of mixture components, and N is the number of dataitems. The variational CD M-step is then performed, which updates thecurrent parameter estimate for the model based on the variationaldistribution computed in the CD E-step. The CD E-step and CD M-step areiterated until convergence. Then, the R-Step as shown in algorithm 190is performed, which refines the component-dependent data partitioning.Unless convergence is reached, the CD E-step and CD M-step iteration isagain repeated, and then the R-step is again performed.

Given the CD EM approach, any type of clustering task can be computed.For instance; email can be classified as spam, a set of environmentalmeasurements can be used to classify a weather condition, a piece ofdata can be classified as belonging to a particular cluster (e.g., adocument as coming from a particular author), and so on. Moreover, theapproach will work well with a large training data set, and will also bescalable in the face of very large numbers of clusters (mixturecomponents).

CONCLUSION

FIG. 7 shows a computer 230. The computer 230 may run any of theembodiments described, either in entirety or in part (e.g., in the caseof distributed execution). The computer 230 may include some form ofstorage 232 working in cooperation with processor(s) 234. The storage232 may include on-chip memory of processor(s) 234, recordable storagemedia, volatile memory, non-volatile memory, or any other current orfuture means for storing digital information in a form readily usable bya computer.

With the description above, one will be able to implement theembodiments and features discussed above using ordinary softwaredevelopment tools. The embodiments, as stored in storage 232, may takedifferent forms such as machine executable instructions (e.g., compiledexecutable binary code), source code, bytecode, or any other informationthat can be used to enable or configure computing devices to perform thevarious embodiments discussed above. This is also deemed to include atleast volatile memory such as RAM and/or virtual memory storinginformation such as CPU instructions during execution of a programcarrying out an embodiment, as well as non-volatile media storinginformation that allows a program or executable to be loaded andexecuted. The embodiments and features can be performed on any type ofcomputing device, including portable devices, workstations, servers,mobile wireless devices, and so on.

1. A method of performing a variational EM algorithm on a computingdevice to compute a mixture model from data items in computer storage,the method comprising: generating a hierarchical data partition of thedata items; initializing component-dependent data partitions of the dataitems in the hierarchical data partition that are specific to themixture components; performing a variational Expectation Maximization(EM) algorithm until convergence is determined, the performingcomprising iteratively performing: a variational E-step that computes avariational distribution by treating all data in a component-dependentdata partition as similar for a mixture component, wherein at least twoof the component-dependent data partitions are different for differentmixture components; and a variational M-step where a current parameterestimate for the mixture model is updated based on the variationaldistribution.
 2. A method according to claim 1, wherein the performingof the variational EM algorithm further comprises a variational R-stepthat refines at least one of the component-dependent data partitions. 3.A method according to claim 2, wherein until convergence is reached, thefollowing is performed iteratively: the variational E-step and thevariational M-step are performed iteratively, and then the variationalR-step is performed.
 4. A method according to claim 3, wherein thevariational E-step and the variational M-step are not run to convergencebetween performances of the R-steps.
 5. A method according to claim 1,wherein generating the hierarchical data partitioning comprisescomputing a KD-tree or an n-ary tree, and where nodes of such a treecontain sufficient statistics of data items that correspond to thenodes.
 6. A method according to claim 5 where the sufficient statisticsare geometric means.
 7. One or more computer storage storing digitalinformation that enables a computing device to perform a process ofcreating a mixture model from data items, the process comprising:accessing the data items from storage of the computing device;partitioning the data items into a set of blocks and generating ahierarchical data structure comprised of nodes, where each nodecorresponds to one of the blocks and stores sufficient statisticscomputed from the data items in the corresponding block; performing avariational EM algorithm to compute cluster-membership probabilities,where the variational EM algorithm is performed by constraining themembership probabilities such that for a cluster marked at a particularnode all the data items associated with that node behave the same wayfor that cluster, where the EM algorithm is repeated and then an R-stepis performed to refine the partitions.
 8. One or more computer-readablestorage according to claim 7, wherein mixture components in the mixturemodel are from the class of exponential family distributions.
 9. One ormore computer-readable storage according to claim 8, wherein thecomponents in the mixture model comprise multivariate Gaussians.
 10. Oneor more computer-readable storage according to claim 7, wherein thehierarchical data structure comprises a marked partition tree (MPT) andthe E-step comprises traversing the MPT to optimize a variationaldistribution function.
 12. One or more computer-readable storageaccording to claim 10, wherein the E-step solves a constrainedoptimization problem with nested constraints, the E-step comprising:traversing the MPT bottom up, level by level and gradually reducing thenested constraints to a single equation involving only one Lagrangemultiplier; solving this equation to determine the Lagrange multiplier;traversing the MPT top down, level by level and gradually resolving theremaining Lagrange multipliers by back-substitution of previouslyresolved Lagrange multipliers; and using the resolved Lagrangemultipliers to determine the optimal variational distribution.
 13. Oneor more computer-readable storage according to claim 12, wherein thehierarchical data structure comprises a marked partition tree (MPT) andif the MPT has any unmarked nodes without marked ancestors, theseunmarked nodes are removed from the MPT to create a plurality of MPTs,which are then treated independently during estimation of thevariational distribution function.
 14. One or more computer-readablestorage according to claim 13, wherein one or more blocks in one or morecomponent-dependent partitions are refined at the R-step.
 15. One ormore computer-readable storage according to claim 7, wherein the R-stepis performed after respective executions of the EM algorithm, wherein Kblocks are refined each time the R-step is performed, where K is thenumber of mixture components in the mixture model.
 16. One or morecomputer-readable storage according to claim 7, where refinement unitsin the R-step are chosen from a ranking of candidate refinement units,this ranking of candidate refinement units comprising: for eachrefinement unit, performing a variational E-step for the currentcomponent-dependent partitions refined with the candidate refinementunit; and ranking the candidate refinement units according a scoringmetric being the lower bound for the log-likelihood obtained after theE-step.
 17. One or more computer-readable storage according to claim 16,wherein the hierarchical data structure comprises a marked partitiontree (MPT), wherein a candidate refinement unit is associated with amark in the MPT; and wherein the variational E-step in the rankingprocess is performed by a local computation scheme that only allowsupdates to the variational distribution with respect to elements of thevariational distribution associated with the node and its children asdefined by the refinement unit.
 18. A method performed by one or morecomputing devices, the method comprising: performing a variational EMalgorithm on a set of data items based on a mixture model comprised ofmixture components, wherein the variational E-step of the variational EMalgorithm is performed on component-dependent partitions of the dataitems in a hierarchical data structure, a component-dependent partitionbeing dependent on a corresponding component of the mixture model.
 19. Amethod according to claim 18, wherein the method begins with a coarseinitialization of the component-dependent partitions.
 20. A methodaccording to claim 19, wherein, starting from initialcomponent-dependent partitions, the following is repeated untilconvergence: the variational E-step and a variational M-step arerepeated one or more times followed by an R-step.